NATS cluster mode and test matrix#76
NATS cluster mode and test matrix#76poelzi wants to merge 16 commits intobluecatengineering:masterfrom
Conversation
Add opt-in clustered DHCP mode with NATS coordination config: - T001: Add BackendMode enum (standalone/clustered) to wire config with default standalone and normalized accessors on DhcpConfig - T002: Add NatsConfig, NatsSubjects, NatsSecurityMode structs with configurable subject templates and security mode selection - T003: Add validate_cluster_config() enforcing required clustered fields (servers, contract_version, non-empty subjects) only when clustered mode is active; standalone validation path unchanged - T004: Extend CLI config with --backend-mode, --instance-id, and --nats-servers runtime overrides for clustered operation - T005: Update example.yaml and config_schema.json with clustered mode configuration examples and schema definitions - T006: Add 16 new config regression tests covering legacy standalone parsing, clustered config validation (valid/invalid), custom subject overrides, security modes, and fixture files Standalone mode remains default and behaviorally unchanged. All 53 config tests and 7 dora-core tests pass.
…ption coordination Implement the NATS coordination library (libs/nats-coordination) with: - T007: Crate scaffold with Cargo.toml, module layout, workspace wiring - T008: Typed models and JSON codecs for LeaseRecord, HostOptionLookup request/response, LeaseSnapshot, CoordinationEvent matching AsyncAPI contract - T009: Contract-versioned SubjectResolver with configurable templates, default prefix, and placeholder/empty-subject validation - T010: NatsClient connection manager wrapping async-nats with optional auth modes (none/user_password/token/nkey/tls/creds_file), connection state observability, publish/request helpers with timeout - T011: LeaseCoordinator with reserve/lease/release/probate/snapshot APIs, revision-aware conflict retry, and degraded-mode blocking - T012: HostOptionClient with hit/miss/error outcome classification, correlation IDs, and bounded timeout (errors don't block DHCP) - T013: 59 unit tests covering subject generation, codec round-trips, error classification, timeout/conflict retry, and degraded-mode behavior
…ded mode, and metrics - T014: Wire backend mode selection in bin/src/main.rs (standalone SQLite vs clustered NATS) - T015: Refactor leases plugin with LeaseBackend trait, StandaloneBackend, ClusteredBackend - T016: Strict uniqueness conflict handling with bounded retries - T017: Degraded-mode: block new allocations on NATS loss, allow known-lease renewals - T018: Post-outage reconciliation via snapshot refresh - T019: 7 cluster operational metrics in dora-core/src/metrics.rs - T020: Integration tests deferred (need NATS test harness from WP08)
…nd enrichment - T021: New plugin crate plugins/host-option-sync/ with v4/v6 registration - T022: Host identity resolution (client identifier first, MAC fallback, v6 DUID support) - T023: Host-option lookup via nats-coordination with correlation IDs and timeout - T024: Response enrichment with protocol/subnet applicability checks - T025: Miss/error/timeout fallback behavior with observability events - T026: Plugin wired into bin/src/main.rs for v4 and v6 pipelines - T027: Unit tests for hit/miss/error/timeout and option injection
…plugin lazy_static WP05 (T028-T034): - Stateful DHCPv6 lease flow (allocate, renew, release, decline) - DUID+IAID uniqueness key extraction and validation - Multi-lease support per DUID when IAID differs - DHCPv6 degraded-mode behavior matching v4 outage policy - DHCPv6 cluster metrics and tests CHG-001 (metrics locality): - Remove centralized cluster/host-option metrics from dora-core/src/metrics.rs - Add plugins/leases/src/metrics.rs with lazy_static for all cluster v4/v6 metrics - Add lazy_static metrics inline in plugins/host-option-sync/src/lib.rs - Update bin/src/main.rs to reference leases::metrics::CLUSTER_COORDINATION_STATE - Policy: each plugin owns its metrics with lazy initialization
…ntining - Quarantine conflicted IPs via probation instead of retrying same address - On conflict in reserve_first, loop to allocate a different IP - Release locally reserved IP on coordination errors to prevent leaks - Increase MAX_CONFLICT_RETRIES from 3 to 8 - Track conflict state to only increment resolved metric when appropriate
|
Hey! This is a lot of code that appears to be substantial changes to how dora works, it would have been better to propose changes more piece-meal in individual issues so some back and forth could take place. I'm assuming it was written with an LLM coding assistant? In any case, thanks for the contribution. |
|
Absolutely, I dislike so big changes too. It is not a fundamental change it is an alternative backend as a plugin. Very little changes to the standalone codepath happened. Most of the code is actually a nix based test framework I build for testing all versions with different dhcp clients and a load tester. Part is also the missing dhcpv6 types that where missing. Yes developed it with spec-kitty, opus and codex mostly and multiple rounds of feedback and reviews. Codex did most of the planning, opus the implementation, both did review, then I had complains 😆 We have the protrntial to remove ~ 2k LoC by using the dhcpproto create for the dhcp definitions. I would prefer that. I could remove the memory backend and use the sqlite in memory mode or just keep it as it was. It doesn't matter because the cluster lease ist checked against die global db anyways. The only thing that changsed in the default path is that the server only reports healthy when the subsystems started. He was reporting healthy directly after start even tho, the subsystems could have problems. |
|
Ah, I see that the last refactoring moved stuff wrongly around - I will shrink this down |
…, degraded mode, and metrics
…okup, and enrichment
…to per-plugin lazy_static
I just meant that is a substantial change to dora in the general sense i.e. it adds a huge feature At present, dora trades performance for simplicity. I talk about that a little in the README. That's why there's no existing in-memory mode, but we do have an open issue for it. Stateful v6 should be separated as well. A distributed mode is something I'm interested in, but I haven't evaluated the alternatives in a serious way. Can you provide some more info on why you chose NATS? I'll have a look at the testing framework, a better test suite is something we could use. I've not really been satisfied with the network namespaces approach we use in the component tests. |
03b8b6a to
5b49a35
Compare
|
Sorry for the bad refactoring earlier, that was definitely to much for poor codexes head. Opus 4.6 does such a better job at tasks like that. I like rust software the most, that's why I'm choosing rust tools, this server was the closest to what I need, so I implemented the rest. This project is 10 man years and I'm building it alone in a few months - I'm more an architect then a coder, but I can smell bad code and can instruct an army of agents to build what I want. My code got reviewed by opus 4.6 max, codex 4.6 xhigh, kimi 2.5, GLM-5 and I looked over it - every complain got fixed. Every reviewer got something to complain - even wired race conditions got found. My code is matrix tested and 2 different loadtests have proven better performance then current code. TBH. I don't care if this gets merged. Opus is so good at solving merge conflicts and rebases, I'm using nix, I don't care about forks. It some point I just get bored managing multiple integration branches. I for sure will fix the other findings of the code in my branch. I just find it better when projects stay together and things go upstream. nats backend is feature gated |
|
On Sun, Mar 01, 2026 at 01:05:55AM -0800, Daniel Poelzleithner wrote:
TBH. I don't care if this gets merged.
Please care about the people that review merge requests.
I just find it better when projects stay together and things go upstream.
Yes, that will bring humankind further.
|
To be clear, this is in the README as an explicit non-goal because simplicity was preferred over "good enough" performance. #63 is open to provide an in-memory lease option. With respect, I appreciate contributions but you've not provided an explanation of how things work. Say nodes hand out conflicting IPs, or there's a network partition preventing communication, or increased latency. |
|
@jsilke the behavior is documented in How the different NATS subjects are used together with their format is also documented there. |
Sorry, thats a big one (lots of test infrastructure code)
I have implemented a full NATS backend that allows to run multiple dora server in the same network in a high availability setup (active-active).
It also allows to send host specific options via a NATS jetstream KV bucket. This will allow easy control in a huge cluster environments.
I added a complete nix based test framework that tests standalone and NATS version against an array of clients and creates a matrix report that can be used as long term test reports.
The NATS version also uses 2 load generators, the kea and the new dhcp-loadtest
It will be easy to add new dhcp clients to the test matrix after that.
The NATS version seems faster then the sqlite standalone version in the test VM.